What do we mean by data wrangling?
Lots of possible definitions, but generally refers to all the transformations you might need to do to get data into a form that is ready for further analysis.
Create a new column based on calculations in your data
Select only certain parts of your data
Do calculations on subsets of your data
Modify text data
Iterate over your data
Merge data from multiple sources
Create a new column based on calculations in your data: mutate()
Select only certain columns or rows of your data: filter(), select()
Do calculations on subsets of your data: group_by(), summarise()
Modify text data: stringr functions
Iterate over your data: for loops, apply(), purrr::map()
Merge data from multiple sources: join()
I just downloaded it for the first time.
I use it, but it makes me uncomfortable.
I’m comfortable with it, here to learn about the tidyverse.
No idea what you’re talking about.
I’ve heard of and used these things, but they’re a stretch.
Easy peasy.
A collection of R packages designed for data wrangling and visualization, built for tidying your data and working with tidy data.
library(tidyverse)Elegant, consistent way of organizing data
First argument to each function is typically a data frame (consistency)
More like reading English than base R code
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table. (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
An untidy example:
| names | favorite_food | favorite_color |
|---|---|---|
| Adrienne | burritos | yellow |
| Alex | pizza | turquoise |
Tidied:
| names | variable | value |
|---|---|---|
| Adrienne | favorite_food | burritos |
| Alex | favorite_food | pizza |
| Adrienne | favorite_color | yellow |
| Alex | favorite_color | turquoise |
Sometimes messy can be good!
Data entry
Data with matrix operations (e.g., statistical modeling)
Nice to know your options
Read data using read_csv().
ufo <- read_csv("../data/ufo_dat.csv")## Parsed with column specification:
## cols(
## datetime = col_character(),
## city = col_character(),
## state = col_character(),
## country = col_character(),
## shape = col_character(),
## `duration (seconds)` = col_integer(),
## `duration (hours/min)` = col_character(),
## comments = col_character(),
## `date posted` = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 266 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1606 duration (seconds) no trailing characters .5 '../data/ufo_dat… file 2 1653 duration (seconds) no trailing characters .5 '../data/ufo_dat… row 3 1660 duration (seconds) no trailing characters .1 '../data/ufo_dat… col 4 1683 duration (seconds) no trailing characters .5 '../data/ufo_dat… expected 5 2039 duration (seconds) no trailing characters .05 '../data/ufo_dat…
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
ufo1 <- read.csv("../data/ufo_dat.csv")Many structures work, but you should definitely have one.
Consider a “recipe” script.
setwd()read.csv("adrienne/cool_project1/version6/sdfljk.csv")read_csv() / read.csv()system.time(ufo <- read_csv("../data/ufo_dat.csv"))## user system elapsed
## 0.525 0.023 0.553
system.time(ufo1 <- read.csv("../data/ufo_dat.csv"))## user system elapsed
## 2.722 0.038 2.768
read_csv() is “lazy and surly” (Wickham).
read.csv(): stringsAsFactors = TRUE, uses rownames
View(ufo)%>%head()mean()ufo <- ufo %>%
rename("duration_sec" = `duration (seconds)`)This is the same as:
ufo <- rename(ufo, "duration_sec" = `duration (seconds)`)mutate()
ufo <- ufo %>%
mutate(duration_min = duration_sec/60) %>%
mutate(duration_hr = duration_min/60)
kable(head(ufo[, c("duration_sec", "duration_min", "duration_hr")]))| duration_sec | duration_min | duration_hr |
|---|---|---|
| 2700 | 45.0000000 | 0.7500000 |
| 7200 | 120.0000000 | 2.0000000 |
| 20 | 0.3333333 | 0.0055556 |
| 20 | 0.3333333 | 0.0055556 |
| 900 | 15.0000000 | 0.2500000 |
| 300 | 5.0000000 | 0.0833333 |
ufo <- ufo %>%
mutate(duration_min = duration_sec/60) %>%
mutate(duration_hr = duration_min/60)In tidyverse functions, column names typically don’t get quotes.
?mutatemutate()Challenge: Add a column that gives the duration in years. (Hint: there are 8760 hours in a year.)
ufo <- ufo %>%
mutate(duration_years = duration_hr/8760)filter(), select()
ufo %>%
filter(duration_hr > 10) %>%
select(datetime, duration_sec, duration_hr, `duration (hours/min)`) %>%
arrange(desc(duration_hr)) %>%
head()## # A tibble: 6 x 4
## datetime duration_sec duration_hr `duration (hours/min)`
## <chr> <int> <dbl> <chr>
## 1 10/1/1983 17:00 97836000 27177. 31 years
## 2 6/3/2010 23:30 82800000 23000. 23000hrs
## 3 9/15/1991 18:00 66276000 18410. 21 years
## 4 4/2/1983 24:00 52623200 14618. 2 months
## 5 8/10/2012 21:00 52623200 14618. 2 months
## 6 8/24/2002 01:00 52623200 14618. 2 months
31 years?!
comment <- ufo %>%
filter(duration_sec > 90000000) %>%
select(comments)
print(comment$comments)## [1] "Firstly, I was stunned and stared at the object for what seemed minutes, but probably was only seconds. My first inclination was to bec"
Challenge: Get only the observations shorter than a duration of your choice. Save the results to a new data frame called short_sight.
short_sight <- ufo %>% filter(duration_hr < 1)group_by() %>% summarise()
For example, how many sightings were there per state?
n_per_state <- ufo %>%
group_by(state) %>%
count(sort = T)kable(head(n_per_state))| state | n |
|---|---|
| ca | 9655 |
| NA | 5797 |
| wa | 4268 |
| fl | 4200 |
| tx | 3677 |
| ny | 3219 |
Another example
mean_lon <- ufo %>%
group_by(state) %>%
summarise(lon = mean(longitude))kable(head(mean_lon))| state | lon |
|---|---|
| ab | -111.69953 |
| ak | -148.16861 |
| al | -85.60044 |
| ar | -92.45776 |
| az | -111.85600 |
| bc | -112.79980 |
for loops, apply(), purrr::map()
When to iterate:
Anytime you need to do something more than twice (ish)
for loopsAdvantages: many other programming languages also have for loops.
Disadvantages: can be bulky and/or slow compared to other options.
for loopsHow many UFOs were observed in each state?
states <- unique(ufo$state)
answer_df <- data.frame(state = states, n = NA)
for(i in 1:length(states)){
temporary_data <- ufo %>% filter(state == states[i])
answer_df$n[i] <- nrow(temporary_data)
}kable(head(answer_df))| state | n |
|---|---|
| tx | 3677 |
| NA | 0 |
| hi | 353 |
| tn | 1193 |
| ct | 968 |
| al | 691 |
Any differences from our group_by() %>% summarise() approach?
apply functionHow many UFOs were observed in each state?
states <- unique(ufo$state)
count_obs <- function(state_id){
nrow(ufo %>% filter(state == state_id))
}
counts <- lapply(states, count_obs)counts[1:5]## [[1]]
## [1] 3677
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 353
##
## [[4]]
## [1] 1193
##
## [[5]]
## [1] 968
map functionThis is the tidyverse way…
How many UFOs were observed in each state?
states <- unique(ufo$state)
count_obs <- function(state_id){
nrow(ufo %>% filter(state == state_id))
}
ans_list <- map(states, count_obs)
ans_vect <- map_int(states, count_obs)maphead(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))** Example stolen directly from R for Data Science: http://r4ds.had.co.nz/iteration.html#the-map-functions
A list of three linear models…
str(models, max.level = 1)## List of 3
## $ 4:List of 12
## ..- attr(*, "class")= chr "lm"
## $ 6:List of 12
## ..- attr(*, "class")= chr "lm"
## $ 8:List of 12
## ..- attr(*, "class")= chr "lm"
summary(models[[1]])##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1513 -1.9795 -0.6272 1.9299 5.2523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.571 4.347 9.104 7.77e-06 ***
## wt -5.647 1.850 -3.052 0.0137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.332 on 9 degrees of freedom
## Multiple R-squared: 0.5086, Adjusted R-squared: 0.454
## F-statistic: 9.316 on 1 and 9 DF, p-value: 0.01374
names(models[[1]])## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
map(models, "coefficients")## $`4`
## (Intercept) wt
## 39.571196 -5.647025
##
## $`6`
## (Intercept) wt
## 28.408845 -2.780106
##
## $`8`
## (Intercept) wt
## 23.868029 -2.192438
** Caveat: May not be a good idea statistically…
join()Are number of UFO sightings per year associated with economic conditions?
join()First, get economic data: ?economics
kable(head(economics))| date | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|
| 1967-07-01 | 507.4 | 198712 | 12.5 | 4.5 | 2944 |
| 1967-08-01 | 510.5 | 198911 | 12.5 | 4.7 | 2945 |
| 1967-09-01 | 516.3 | 199113 | 11.7 | 4.6 | 2958 |
| 1967-10-01 | 512.9 | 199311 | 12.5 | 4.9 | 3143 |
| 1967-11-01 | 518.1 | 199498 | 12.5 | 4.7 | 3066 |
| 1967-12-01 | 525.8 | 199657 | 12.1 | 4.8 | 3018 |
join()Make a year column in the UFO dataset. This requires a brief detour into lubridate
library(lubridate)
ufo <- ufo %>%
separate(datetime, into = c("date", "time"), sep = " ") %>%
mutate(date = mdy(date)) %>%
mutate(year = year(date))join()Let’s see what we got out of that:
ufo %>%
select(date, time, year) %>%
head() %>%
kable()| date | time | year |
|---|---|---|
| 1949-10-10 | 20:30 | 1949 |
| 1949-10-10 | 21:00 | 1949 |
| 1955-10-10 | 17:00 | 1955 |
| 1956-10-10 | 21:00 | 1956 |
| 1960-10-10 | 20:00 | 1960 |
| 1961-10-10 | 19:00 | 1961 |
join()We’ll need to know how many UFO sightings there were each year in our dataset.
Challenge: how would you do this?
for loopsapply functionsmap functionsgroup_by() %>% summarise()join()ufo_counts <- ufo %>%
group_by(year) %>%
count()
kable(head(ufo_counts))| year | n |
|---|---|
| 1906 | 1 |
| 1910 | 2 |
| 1916 | 1 |
| 1920 | 1 |
| 1925 | 1 |
| 1929 | 1 |
join()We’ll also want economic summaries by year.
econ_summary <- economics %>%
mutate(year = year(date)) %>%
group_by(year) %>%
summarise_all(mean) %>%
ungroup() %>%
select(-date)join()df_new <- left_join(ufo_counts, econ_summary, by = "year") %>%
filter(year >= min(econ_summary$year))
kable(head(df_new))| year | n | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|---|
| 1967 | 188 | 515.1667 | 199200.3 | 12.30000 | 4.700000 | 3012.333 |
| 1968 | 220 | 557.4583 | 200663.8 | 11.21667 | 4.500000 | 2797.417 |
| 1969 | 155 | 604.4833 | 202648.7 | 10.74167 | 4.441667 | 2830.167 |
| 1970 | 147 | 647.6917 | 204982.3 | 12.60833 | 4.983333 | 4127.333 |
| 1971 | 130 | 701.0000 | 207589.3 | 13.25833 | 6.275000 | 5021.667 |
| 1972 | 158 | 769.4333 | 209837.6 | 12.11667 | 6.108333 | 4875.833 |
meltWhat if we wanted to plot all these economic variables at the same time?
meltlibrary(reshape2)
df_long <- df_new %>%
melt(id.vars = c("year", "n"))
kable(head(df_long))| year | n | variable | value |
|---|---|---|---|
| 1967 | 188 | pce | 515.1667 |
| 1968 | 220 | pce | 557.4583 |
| 1969 | 155 | pce | 604.4833 |
| 1970 | 147 | pce | 647.6917 |
| 1971 | 130 | pce | 701.0000 |
| 1972 | 158 | pce | 769.4333 |
meltggplot(df_long, aes(x = value, y = n)) +
geom_point() +
geom_smooth() +
facet_wrap(~variable, scales = "free_x")stringr functionsWhich UFO observations refer to a color?
ufo <- ufo %>% mutate(comments = tolower(comments))
colors <- c(" red ", " orange ", " yellow ", " green ", " blue ", " purple ", " pink ", " white ", " black ")
ufo_colors <- ufo %>%
mutate(color_ans = str_detect(comments, paste(colors, collapse = "|"))) %>%
filter(color_ans == T)Why the spaces? Any potential problems with spaces?
ufo_colors$comments[1:10]## [1] "a bright orange color changing to reddish color disk/saucer was observed hovering above power transmission lines."
## [2] "round , orange , with what i would say was polished metal of some kind around the edges ."
## [3] "silent red /orange mass of energy floated by three of us in western north carolina in the 60s"
## [4] "oh, what a night ! two (2) saucer-shaped, glowing green objects and one (1) brilliantly glowing sphere gliding over the lake."
## [5] "a red glowing sphere stopped and watched me."
## [6] "three light in the sky that led to a big black silent triangle shaped craft."
## [7] "football field sized chevron with bright white lights on front, moving slowly with absolutely no sound"
## [8] "a silent black triangular object with no lights moved over us as we walked our dog across a school playground."
## [9] "1 object with green and red lights"
## [10] "man on hwy 43 sw of milwaukee sees large, bright blue light streak by his car, descend, turn, cross road ahead, strobe. bizarre!"
colors_ans <- map(colors, function(color){str_extract(ufo_colors$comments, color)})
colors_df <- as.data.frame(colors_ans)
colors_df <- colors_df %>%
unite("colors_included", 1:ncol(colors_df), sep = ";") %>%
mutate(colors_included = str_replace_all(colors_included, "NA;", "")) %>%
mutate(colors_included = str_replace_all(colors_included, ";NA", ""))
head(colors_df)## colors_included
## 1 orange
## 2 orange
## 3 red
## 4 green
## 5 red
## 6 black
ufo_colors <- ufo_colors %>% mutate(colors = colors_df$colors_included)“You never get better at regex; you just get better at googling.”
What colors are most common in UFOs, and is there a relationship between color and shape?
library(tidytext)
colors_df <- ufo_colors %>%
unnest_tokens(output = colors, input = colors,
token = stringr::str_split, pattern = ";") %>%
mutate(colors = str_replace_all(colors, " ", ""))colors_df %>%
select(date, colors) %>%
head() %>%
kable()| date | colors |
|---|---|
| 1965-10-10 | orange |
| 1968-10-10 | orange |
| 1968-10-10 | red |
| 1973-10-10 | green |
| 1980-10-10 | red |
| 1985-10-10 | black |
colors_df %>%
group_by(colors) %>%
count(sort = T) %>%
ggplot(aes(x = colors, y = n)) +
geom_col() +
coord_flip()but red and orange show up a lot too!
First, we should just get the 5 most common shapes.
Challenge: How would you do that?
shapes <- colors_df %>%
filter(!is.na(shape)) %>%
group_by(shape) %>%
count(sort = TRUE) %>%
ungroup() %>%
slice(1:5)
kable(shapes)| shape | n |
|---|---|
| light | 4941 |
| circle | 2673 |
| triangle | 2242 |
| fireball | 2082 |
| sphere | 1727 |
What’s wrong with this?
Challenge: How would you do that?
shape_colors <- colors_df %>%
filter(shape %in% shapes$shape)ggplot(shape_colors, aes(x = colors, fill = shape)) +
geom_bar(position = "fill")What’s another question we could ask of this data?
What are the programming steps we would need to answer it?
What do each of these functions do?
mutate()
filter()
select()
group_by() %>% summarise()
str_select() or str_extract()
join()
What are some approaches to iteration?
for loops
apply() functions
map() functions
in simple cases, may be replacable with group_by() %>% summarise()
R for Data Science: http://r4ds.had.co.nz/
tidyverse web site: https://www.tidyverse.org/
Thanks!